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Abstract — This paper studies optimization of zero-delay 
source-channel codes, and specifically the problem of obtaining 
globally optimal transformations that map between the source 
space and the channel space, under a given transmission power 
constraint and for the mean square error distortion. Particularly, 
we focus on the setting where the decoder has access to side 
information, whose cost surface is known to be riddled with 
local minima. Prior work derived the necessary conditions for 
optimality of the encoder and decoder mappings, along with 
a greedy optimization algorithm that imposes these conditions 
iteratively, in conjunction with the heuristic "noisy channel 
relaxation" method to mitigate poor local minima. While noisy 
channel relaxation is arguably effective in simple settings, it 
fails to provide accurate global optimization results in more 
complicated settings including the decoder with side information 
as considered in this paper. We propose a global optimization 
algorithm based on the ideas of "deterministic annealing"- a non- 
convex optimization method, derived from information theoretic 
principles with analogies to statistical physics, and successfully 
employed in several problems including clustering, vector quanti- 
zation and regression. We present comparative numerical results 
that show strict superiority of the proposed algorithm over greedy 
optimization methods as well as over the noisy channel relaxation. 



I. Introduction 

The zero delay source-channel coding problem has recently 
gained revived interest (l]-|5). In this paper, we focus on 
numerical optimization of the zero-delay mappings. In prior 
work 1 6], a method, "noisy channel relaxation" (NCR) (7J, 
|8| was employed to mitigate the poor local minima problem 
inherent to such optimization problems. While NCR is rela- 
tively successful in the point-to-point setting, it is insufficient 
to obtain precise results in more involved settings such as the 
decoder side information setting. In this paper, we incorporate 
a powerful non-convex optimization method, deterministic 
annealing, within a framework proposed in our prior work 
|6j to numerically obtain the globally optimal zero-delay 
mappings in the side information setting. 

Deterministic annealing (DA) is a global optimization ap- 
proach, based on information theoretic principles with analo- 
gies to statistical physics, that has been successfully used as 
a remedy to the problem of poor local minima in non-convex 
optimization problems, including clustering [9|, vector quan- 
tization pO) , regression [11| and more (see review in |[T2|). 
An important distinction between DA and other non-convex 
optimization tools such as NCR is that DA is independent of 
the initialization. 
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Fig. 1. The problem setting 

This paper is organized as follows. In Section II, we present 
preliminaries and the problem definition. In Section III, we 
review prior work including the necessary conditions for 
optimality, and optimization aided by NCR. In Section IV, 
we describe the proposed algorithm. Numerical comparisons 
are presented in Section V and concluding remarks in Section 
VI. 

II. Preliminaries and Problem Definition 

Let E(-), P(-) and R denote the expectation and probability 
operators, and the set of real numbers, respectively. Let V 
and V ' x denote the gradient and partial gradient with respect 
to x, respectively. Let / (x) = j^' denote the first order 
derivative of function /(■). The joint Gaussian density with 
mean fi and covariance matrix R is denoted as Af(fJ,,R). All 
the logarithms in the paper are natural logarithms and may 
in general be complex. The integrals are in general Lebesgue 
integrals. While we focus on scalar sources and noises, our 
results can easily be extended to vector spaces, albeit with 
more involved notations. 

The problem setting is given in Figure [T] where source 
X G R and side information Z € R are drawn from joint 
density fx,z(',')- Z is available only to the decoder, while 
X is mapped to channel input by the encoding function 
g : M — > R and transmitted over the channel whose additive 
noise N £ R, with density Jn('), is independent of X, Z. 
The received channel output Y = g(X) + N is mapped to the 
estimate X by the decoding function w : R x R — >• R. The 
problem is to find optimal mapping functions g(-),w(-) that 
minimize the mean squared error (MSE) distortion 



subject to 



D = E{{X - X) 2 }, 



P(g)=E{g 2 (X)}<P. 



(1) 



(2) 



Although the problem we consider is delay limited, it is 
insightful to consider asymptotic bounds achievable at infinite 
delay. From Shannon's source and channel coding theorems, it 
is known that, asymptotically, the source can be compressed to 
R(D) bits (per source sample) at distortion level D, and that 
C bits can be transmitted over the channel (per channel use) 
with arbitrarily low probability of error, where R(D) is the 
source rate-distortion function, and C is the channel capacity, 
(see e.g. |13|). The asymptotically optimal coding scheme is 
the tandem combination of the optimal source and channel 
coding schemes, hence R(D) < C must hold. By setting 



R{D) = C, 



(3) 



one obtains a lower bound on the distortion of any source- 
channel coding scheme. The capacity of the additive white 
gaussian noise channel with variance a 2 N is given by 



C = ilog(l 



"n 



(4) 



where P is the transmission power constraint and a N is 
the noise variance. For source coding with decoder side 
information, it has been established for Gaussians and MSE 
distortion that there is no rate loss due to the fact that the 
side information is unavailable to the encoder j fT4] - Hence, 
optimum performance theoretically attainable (OPTA) can be 
obtained by equating the conditional rate distortion function of 
the source (given the side information) to the channel capacity. 
The rate distortion function of X when Z serves as side 

information and [X, Z] ~ Af(0, R) where R — a x 



with \p\ < 1 is: 
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We plug ^ and Q in ^ to obtain OPTA 
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III. 
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Prior Work 



(5) 



(6) 



Here, we summarize the relevant contributions of prior 
work, see |6| for more details. 

A. Necessary Conditions for Optimality 

Let the encoder g(-) be fixed. Then, the optimal decoder is 
the MSE estimator of X given Z — z and Y = y: 

w(y,z)=E{X\y,z}. (7) 

Plugging the expressions for expectation, applying Bayes' 
rule and noting that fY\x(y> x ) = JnIv — g( x )}> the optimal 
decoder can be written, in terms of known quantities, as 

w{y,z) = J - r - — - — — — . (8) 

J fx,z{x,z) f N [y - g{x)\dx 

To derive the necessary condition for optimality of <?(•), we 
consider the distortion functional 



and construct the Lagrangian cost functional: 
J[g,w} = D[g,w} + \P\c,}. 



(10) 



Now, let us assume the decoder «?(•) is fixed. To obtain nec- 
essary conditions, we apply the standard method in variational 
calculus: 

V fl J[s,io]=0 > Vx, (11) 

where 

W g J[g,w}= Xf x (x)g(x) 

w'(g(x)+n, z) {x-w(g(x)+n, z)] fN(n)fx,z(x, z)dndz. 

(12) 

and u/(v) denotes the derivative with respect to the first 
argument. 

Remark 1: Note that the linear encoder and decoder map- 
pings satisfy the necessary conditions for optimality in the 
Gaussian case. However, it is well known that linear mappings 
are highly suboptimal, see e.g. [6|. This fact illustrates the 
existence of poor local optima and the challenges facing 
algorithms based on these necessary conditions. 

B. Greedy Algorithm 

Iteratively alternating between the imposition of individual 
necessary conditions for optimality, will successively decrease 
the Lagrangian cost to a stationary point. Imposing the decoder 
optimality condition is straightforward, since it is expressed in 
closed form as functional of the encoding mapping g(). The 
encoder optimality condition is not in closed form and we 
perform an appropriate steepest descent search. The encoder 
is updated as given below, where i is the iteration index and 
H is the step size. 



?i+i 



(x) = &(»)- liVgJfe,', 



(13) 



D[g, h] = E{(X - w(g(X) + N, Z)) 2 } 7 



(9) 



At each iteration i, total cost decreases monotonically and 
iterations are kept until convergence. 

There is no guarantee that an iterative descent algorithm 
of this type will converge to the globally optimal solution, 
in fact, simulations show severe issues of local optima. As a 
remedy, NCR method of (7), (8| was embedded in the iterative 
algorithm in |6), i.e., the algorithm was run for a very noisy 
channel (high Lagrangian parameter A), and then gradually 
decrease A while using the prior mapping solution as initial 
condition. 

IV. Proposed Method 

We recast the zero-delay source-channel coding problem 
as a regression problem optimizing for the encoding function 
within a given parametric class of functions. We restrict the 
discussion to piecewise regression functions which approx- 
imate the desired mappings by partitioning the space and 
matching a simple local model to each region. Such regression 
functions are determined by specifying two components: a 
space partition and a parametric local model per partition 
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Fig. 2. An example encoder function consisting of affine local models. 
K = 4 



cell (typically a simple model such as constant, linear, or 
GaussiarQ. 

DA introduces controlled randomization into the optimiza- 
tion process. The problem is recast as minimization of the 
expected cost subject to a constraint on the level of randomness 
as measured by the Shannon entropy of the system. The 
resulting Lagrangian functional can be viewed as the free 
energy of a corresponding physical system whose Lagrange 
parameter is the "temperature". The minimization is started 
at a high temperature (highly random mappings) where, in 
fact, the entropy is maximized and all points equally belong 
to the all partition cells (and effectively there is only one local 
model). This minimum is then tracked at successively lower 
temperatures (lower levels of entropy) as the system typically 
undergoes a sequence of phase transitions through which the 
model complexity (the number of distinct local models) grows. 
As the temperature approaches zero, the distortion and power 
terms dominate the Lagrangian cost and a hard (nonrandom) 
mapping is obtained. 

We proceed to describe in more detail the proposed DA- 
based method. 

A. Structured Encoder Functions 

We consider the parametric functions (local models) 
9k(x) — f(x, Aj.), for k <E {1, ..., K}, with the parameter sets 
Afc. These functions have a certain parametric form and each 
function is defined over a region denoted as M. k - The overall 
encoder function is defined as g(x) = g k (x) for x <E Efc. The 
parametric form is to be chosen appropriately depending on 
the involved distributions and the design constraints. Figure [2] 
shows an example structured encoder with affine local models 
of the form g k (x) — a k x + 6fe. 

B. Randomized Associations 

We randomize the associations of the input points to the 
local models, or regions. We first define the probabilities 



p K \x{k\x)±P{x<=R k }, Vk,x. 



(14) 



'in this paper, we use only affine models, however it is straightforward to 
include other models within the optimization framework. 



Note that Yl Pk\x(M x ) = 1 Va;. Next, we rewrite ( 1 1 as 



k=l 



K 

D = 22 D k (x)p x (x)p K \x(k\x)dx, (15) 

fc=i R 

where D k (x) is the contribution to the distortion, when point 
x is associated with region k. It is given by 



Dk(x) = / d(x,w(g k (x) + n,z))p N (n)p z \x(z\x)dzdn. 



The power constraint in |2} is rewritten as 



(16) 



K 

P = X] / 9k( x )Px(x)pK\x(k\x)dx. (17) 



fe=i 

The cost function to minimize is 
J = D + XP 

K 



where 



X / J k (x)px{x)p K \x(k\x)dQ 

fe = l TTJ 



J k (x) ^ D k (x) + Xgt(x) Vfc. 



(18) 
(19) 

(20) 



We now restate the problem as that of minimizing J over 
the local model parameters and association probabilities. Note 
that, given the local models, the association probabilities that 



minimize (19i will implement 'hard' associations, that is, 
every point is associated with probability one to the region 
that contributes the minimum cost to (20 1. Therefore, by 



randomizing the encoder we generalize the search space but 
preserve the same global minimum as the original problem. 

C. Entropy Constraint 

As we noted above, the direct optimization of the associa- 
tion probabilities will result in 'hard' probabilities. However, 
in order to avoid poor local optima we impose and control 
the level of randomness, i.e. we introduce a constraint on 
the randomness of the encoder, which we measure by the 
Shannon entropy. The total entropy of the encoder is given by 
H(X,K) = H(X) + H{K\X) and since H(X) is constant 
(determined by the source) we define H = H(K\X) where 



H(K\X) 



K 



Px(x)y.PK\x(k\x)log(p K \ x (k\x))dx. 



fc=i 



(21) 
Remark 2: It is important to note that the approach is gen- 
eralizable to the "mass-constrained" variant of DA 1 15 1, where 
entropy maximization is effectively replaced by minimization 
of the mutual information I(K: X). Such generalization offers 
additional optimization advantages (see p3|), as well as a 



useful and direct link to rate -distortion theory (see |16| for 
analysis of these connections, as well as DA for rate-distortion 
function computation). The corresponding "mass-constrained" 



T = 0.0183, J = 0.01483, H(K\X) = 0.69 
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Fig. 3. The evolution of the encoder in the algorithm. The two models are shown by dotted lines and the size of a dot gives the probability association at 
that input point to that local model. The black line represents the averaged encoder, K = 2. 



extension for the current problem is a work in progress and is 
outside of the scope of this paper. 

Accordingly we construct the Lagrangian 



F = J - TH, 



(22) 



to be minimized, with T (temperature) being the Lagrange 
multiplier associated with the entropy constraint. Note that 
for large T, the minimum F is achieved by maximizing 
the entropy. At lower values of T, randomness is traded for 
reduction in J. In the limit T = 0, minimizing F corresponds 
to minimizing J directly, which produces a deterministic 
encoder. Therefore, we start at a high value of T and gradually 
lower it while minimizing F at each step. 

We present an example of the method in Figure [3] with 
two local model^] When T is large, the local models are 
coincident. As we lower T, the system goes through a bi- 
furcation point (referred as "phase transition" in statistical 
physics) where the two local models split from each other 
to decrease F. The corresponding value of T is referred as 
the first "critical temperature". Further phase transitions can 
be obtained by keeping a duplicate for each local model at 
every temperature. The duplicates will merge at every iteration 
until a critical temperature is reached, and will split at a phase 
transition. 

The pseudocode of the method is given in Algorithm [T] 

D. Update Equations 

The optimum local model parameters cannot be obtained in 
closed form, hence we perform gradient descent search. The 

2 The example is run for jointly Gaussian source and side information and 
Gaussian noise. 



Algorithm 1 The outline of the proposed algorithm 
Initialize: High T, single region (K=l) 
while H{K\X) > H min (K\X) do 

Duplicate (if K < K max ) and perturb local models 
while costi + i < costi do 

update the local model parameters 
update px\x(k\x) Mk,x 
update w(y, z) 
end while 

Check if regions have split 

Set T = aT > e.g. a = 0.95 

end while 



gradient with respect to any local parameter 9^ from a set Aj, 
can be obtained as 

OF dJ 



de k de k 



^ d[D k (x) + \gl(x)} A 
Px{x)p K \x{k\x) dx. 



de k 



(23) 
For the affine model, 8 k denotes a k and b k . 

The association probabilities that minimize F are derived 
in a straightforward fashion as the Gibbs distribution 

e -[D k (x) + Xg 2 k (x)]/T 

PK\x{k\x) = — Vx. (24) 



A' 

fe=i 



[D k (x)+\gl(x)]/T 



Remark 3: As expected, (24 1 results in uniform associations 



for large T and "hard" (binary) associations for T = 0. 
The optimum decoder given the encoder can be derived by 

K 

plugging p(y\x, z) = J2 Pn(v ~ gk{x))p K \x(k\x) in (8 i. 



fc=i 
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Fig. 4. The performance comparison of the proposed method with greedy 
optimization, the noisy relaxation (NCR), and the linear mappings. 




Fig. 5. An example mapping for correlation coefficient p = 0.99 at 
CSNR = 10.98, SNR = 23.2. 



V. Experimental Results 

The comparative performance results are given in Figure HI 
for jointly Gaussian X and Z with p = 0.99 as described in 
Section II, and Gaussian noise with unit variance. The best 
results of the NCR method out of multiple runs, and various 
results from the greedy method are presented in Figure pj Note 
that the proposed method is independent of the initialization 
and only run once, whereas the results of greedy approach 
and NCR heavily depend on initialization, as can be seen 
from various points obtained by the greedy approach. We 
also present the performance of OPTA as benchmark while 
noting that it is asymptotic and requires infinite delay. The 
performance of linear encoder and decoder is plotted as well, 
since it is also a local minimum (see Remark [TJ. 

An example mapping from the same setting is also given in 
Figure [5j Interestingly, as noted before (see e.g., (5), fl6}) the 
analog mapping captures the central characteristic observed 
in digital Wyner-Ziv mappings, in the sense of many-to-one 
mappings, where multiple source intervals are mapped to the 
same channel interval, which will potentially be resolved by 
the decoder given the side information. However, we see 
differences between the mappings obtained by NCR (see 
e.g., p), (61) and ones by the proposed DA based method, 



e.g, the linear trend of the encoding mapping, that yield 
significant performance improvement as shown in Figure ffj 
Such differences are difficult to obtain and very important for 
the design of parametric mappings, see e.g., H). 

VI. Conclusions 

In this paper, we studied the problem of finding globally op- 
timum encoder and decoder pairs in zero delay source-channel 
coding, focusing on the setting where a side information is 
available to the decoder. Since the cost surface is riddled with 
locally optimum points, we developed a method based on the 
deterministic annealing approach to obtain globally optimum 
points. The numerical results show superiority of the proposed 
algorithm over greedy optimization methods and as well as the 
previously adopted approach, i.e., NCR. As future work, we 
will investigate adopting our DA approach to obtain optimal 
mappings in more complicated network settings as well as 
well-known open control problems such as the Witsenhausen's 
counterexample. 
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